Sentence signatures replace words (of length>3) by their character length. Hence, similar sentences will have similar sentence signatures. At least for longer sentences, the converse might be true: Similar signatures are interesting to find similar sentences.
The first table gives the number of different signatures compared with the number of sentences:
The second table shows the most frequent signatures. Note, however, that frequent but short signatures may appear independent of sentence similarity.
Sentence signatures are a tool to find near doublets of sentences.
Sentence signatures show their power only on large corpora because small corpora usually contain only few near doublets.
Table 1:
select aa.a,bb.b, aa.a/bb.b from (select count(distinct signature_untok) as a from para_s) aa,(select count(*) as b from para_s) bb;
Table 2:
select signature_untok, count(*) as anz from para_s group by signature_untok order by anz desc limit 20;
How many near doublets can we expect in a large corpus?
4.7.1.2 Sentences with Most Frequent Sentence Signatures